{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c1ca65de",
   "metadata": {},
   "source": [
    "# Ay 119 - Intro to ML: regression\n",
    "\n",
    "This notebook presents the main ideas discussed today using regression as an introduction to machine learning.\n",
    "\n",
    "The execises cover:\n",
    "\n",
    "1. Photometric redshift with linear and multiple regression  \n",
    "2. Stellar temperature from color with polynomial regression and cross-validation  \n",
    "3. Weighted regression with heteroscedastic errors  \n",
    "4. Robust regression in the presence of outliers  \n",
    "5. Gaussian process regression for an irregularly sampled light curve  \n",
    "6. A second Gaussian process exercise: periodic vs. non-periodic kernels  \n",
    "\n",
    "For each exercise, you will be asked to provide an interpretation of your results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8fa43ff",
   "metadata": {},
   "source": [
    "## Suggested setup\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3ecdadd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.model_selection import train_test_split, cross_val_score\n",
    "from sklearn.linear_model import LinearRegression, HuberRegressor, LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n",
    "from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix, classification_report\n",
    "from sklearn.gaussian_process import GaussianProcessRegressor\n",
    "from sklearn.gaussian_process.kernels import (\n",
    "    RBF,\n",
    "    WhiteKernel,\n",
    "    ConstantKernel as C,\n",
    "    ExpSineSquared\n",
    ")\n",
    "\n",
    "rng = np.random.default_rng(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6934393e",
   "metadata": {},
   "source": [
    "## 1. Photometric redshift with linear and multiple regression\n",
    "\n",
    "### Problem\n",
    "A common astronomy problem is to estimate the redshift of a galaxy from broadband photometry. We'll simulate a galaxy sample with magnitudes in three bands and use them to predict redshift.\n",
    "\n",
    "To do:\n",
    "1. Fit a model using one color only.\n",
    "2. Fit a model using multiple colors.\n",
    "3. Compare train/test MSE and $R^2$.\n",
    "4. Inspect the residuals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c64e3e16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simulate galaxy photometry and redshift\n",
    "n = 600\n",
    "z = rng.uniform(0.0, 1.2, n)\n",
    "\n",
    "g = 21.5 + 2.2 * z + rng.normal(0, 0.18, n)\n",
    "r = 20.9 + 1.5 * z + rng.normal(0, 0.15, n)\n",
    "i = 20.5 + 1.0 * z + rng.normal(0, 0.15, n)\n",
    "\n",
    "# use train_test_split to create the test and training data sets\n",
    "\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "caa4c7bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot residuals as a function of redshift\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be6fbc24",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "TO BE FILLED IN"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2072279",
   "metadata": {},
   "source": [
    "## 2. Stellar temperature from color: polynomial regression and model selection\n",
    "\n",
    "### Problem\n",
    "A star's color is related to its effective temperature, but here we deliberately simulate a relation with strong curvature and a local bend. A straight line should fail badly, while an intermediate-order polynomial should work much better.\n",
    "\n",
    "1. Fit polynomial models of varying degree.\n",
    "2. Use cross-validation to choose the degree.\n",
    "3. Compare the fitted curves.\n",
    "4. Compare the test-set MSE for the linear model and the best polynomial model.\n",
    "5. Explain why the best degree model is not automatically the best even if it looks flexible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eec70b2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 340\n",
    "color = rng.uniform(-0.4, 2.2, n)\n",
    "\n",
    "# Construct a strongly curved relation with an inflection and a local feature\n",
    "Teff_true = (10300 - 6500 * color\n",
    "            + 3200 * color**2\n",
    "            - 850 * color**3\n",
    "            + 900 * np.exp(-0.5 * ((color - 0.25) / 0.16)**2)\n",
    ")\n",
    "\n",
    "Teff = Teff_true + rng.normal(0, 110, n)\n",
    "\n",
    "# Create test and training sets and fit with polynomial models and cross validation\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7025246",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot T_eff against color, the true relation, and the different models\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10d7fb0c",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "TO BE FILLED IN"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "943b74d2",
   "metadata": {},
   "source": [
    "## 3. Weighted regression with heteroscedastic errors\n",
    "\n",
    "### Problem\n",
    "Astronomical measurements often have very different uncertainties from point to point  In this exercise, the precise points follow one underlying relation, but the noisy points are both much less reliable and systematically biased upward at high redshift. We'll compare an unweighted fit with a weighted fit.\n",
    "\n",
    "1. Fit an unweighted regression.\n",
    "2. Fit a weighted regression using inverse-variance weights.\n",
    "3. Compare the fitted slopes and intercepts.\n",
    "4. Compute a weighted chi-squared for both fits.\n",
    "5. Explain why the weighted fit is statistically preferred."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e037e1c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create data set\n",
    "n_precise = 95\n",
    "n_noisy = 45\n",
    "\n",
    "x_precise = np.sort(rng.uniform(0.01, 0.09, n_precise))\n",
    "x_noisy = np.sort(rng.uniform(0.09, 0.18, n_noisy))\n",
    "\n",
    "x = np.concatenate([x_precise, x_noisy])\n",
    "mu_true = 35.5 + 22.0 * x\n",
    "\n",
    "sigma_precise = np.full(n_precise, 0.05)\n",
    "sigma_noisy = np.full(n_noisy, 0.55)\n",
    "sigma = np.concatenate([sigma_precise, sigma_noisy])\n",
    "\n",
    "# Precise low-z points follow the truth closely\n",
    "mu_precise = 35.5 + 22.0 * x_precise + rng.normal(0, sigma_precise)\n",
    "\n",
    "# Noisy high-z points are both noisy and biased upward\n",
    "mu_noisy = 35.5 + 22.0 * x_noisy + 1.1 + rng.normal(0, sigma_noisy)\n",
    "\n",
    "mu_obs = np.concatenate([mu_precise, mu_noisy])\n",
    "\n",
    "X = x.reshape(-1, 1)\n",
    "\n",
    "# Fit unweighted and weighted linear regression models\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "852e4bfa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the data and the weighted and unweighted fits\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2ed28ec",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "TO BE FILLED IN"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91c0dd72",
   "metadata": {},
   "source": [
    "## 4. Robust regression with outliers\n",
    "\n",
    "### Problem\n",
    "Time series from astronomical surveys often contain a few bad points due to weather, cosmic rays, blending, or reduction failures. In this exercise, the underlying trend is mild, but the outliers are strong enough to bias ordinary least squares.\n",
    "\n",
    "1. Fit ordinary least squares.\n",
    "2. Fit a robust model using Huber regression.\n",
    "3. Compare the inferred slopes.\n",
    "4. Compare the residual scatter after excluding the most extreme points.\n",
    "5. Explain why the robust fit is preferable here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e18b917",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create data set\n",
    "n = 130\n",
    "t = np.linspace(0, 220, n)\n",
    "y_true = 18.3 + 0.0045 * t\n",
    "y = y_true + rng.normal(0, 0.08, n)\n",
    "\n",
    "idx_hi = rng.choice(np.arange(n//2, n), size=7, replace=False)\n",
    "idx_lo = rng.choice(np.arange(0, n//2), size=4, replace=False)\n",
    "idx = np.concatenate([idx_hi, idx_lo])\n",
    "\n",
    "y[idx_hi] += rng.normal(1.8, 0.35, len(idx_hi))\n",
    "y[idx_lo] -= rng.normal(1.4, 0.30, len(idx_lo))\n",
    "\n",
    "X = t.reshape(-1,1)\n",
    "\n",
    "# Fit regression models\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73d99672",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the data and the regression models\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97671fe2",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "TO BE FILLED IN"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c04389b",
   "metadata": {},
   "source": [
    "## 5. Gaussian process regression for an irregularly sampled light curve\n",
    "\n",
    "### Problem\n",
    "Irregular sampling is common in time domain astronomy. A common way to deal with this is Gaussian process regression. So in this first GPR exercise, we will compare a GP model to a polynomial fit to a simulated stochastic, smooth light curve sampled at irregular times with measurement uncertainties.\n",
    "\n",
    "1. Fit a Gaussian process.\n",
    "2. Plot the predictive mean and uncertainty band.\n",
    "3. Identify the learned amplitude and characteristic timescale from the kernel.\n",
    "4. Compare the GP to a cubic polynomial fit on the same data.\n",
    "5. Estimate the predictive uncertainty near the center of the time series and near a gap or edge.\n",
    "6. Briefly explain why a GP is often preferable for interpolation of irregular astronomical light curves."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94e52ada",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create data set\n",
    "n = 85\n",
    "t_left = np.sort(rng.uniform(0, 42, n//2))\n",
    "t_right = np.sort(rng.uniform(58, 100, n - n//2))\n",
    "t = np.sort(np.concatenate([t_left, t_right]))\n",
    "\n",
    "signal = 0.65 * np.sin(t / 8.0) + 0.28 * np.cos(t / 17.5)\n",
    "err = rng.uniform(0.04, 0.11, n)\n",
    "y = signal + rng.normal(0, err)\n",
    "\n",
    "X = t.reshape(-1,1)\n",
    "\n",
    "# Create a GP Regressor with an appropriate kernel - I suggest a RBF (radial basis function)\n",
    "# Don't forget to also create the polynomial fit\n",
    "\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fc6e98a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the time series and the model fits\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa556555",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "TO BE FILLED IN"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b630d8b",
   "metadata": {},
   "source": [
    "## 6. Gaussian-process model comparison: periodic vs. non-periodic kernels\n",
    "\n",
    "### Problem\n",
    "Different kernels encode different physical assumptions. In this second GP exercise, we simulate a clearly periodic variable star light curve and compare a smooth non-periodic kernel to a periodic kernel. To make the distinction visible, we use a **blocked train/test split**:\n",
    "\n",
    "- train on the earlier part of the light curve,\n",
    "- test on the later part.\n",
    "\n",
    "This turns the problem into a mild extrapolation task, where the difference between “smooth” and “repeating” should matter much more.\n",
    "\n",
    "1. Simulate a noisy, irregularly sampled periodic light curve.\n",
    "2. Fit one GP with an RBF kernel.\n",
    "3. Fit another GP with a periodic kernel.\n",
    "4. Compare the two models visually and via test-set MSE.\n",
    "5. Inspect the learned period from the periodic kernel.\n",
    "6. Which kernel better captures the physics of the source?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "207f0223",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the data set\n",
    "n = 140\n",
    "t = np.sort(rng.uniform(0, 100, n))\n",
    "period_true = 7.4\n",
    "\n",
    "signal = 1.05*np.sin(2*np.pi*t/period_true) + 0.18*np.cos(4*np.pi*t/period_true)\n",
    "err = rng.uniform(0.04, 0.08, n)\n",
    "y = signal + rng.normal(0, err)\n",
    "\n",
    "X = t.reshape(-1,1)\n",
    "\n",
    "# Create a blocked split: train early, test late\n",
    "train_mask = t < 60\n",
    "test_mask = t >= 60\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "931d2837",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the data and the models\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cd63d4b",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "TO BE FILLED IN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "562ff1ec",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}